TOOLS/DEVICE: support channel id in perftest #10993

jeynmann · 2025-11-06T03:39:22Z

What?

Support channel id in ucx perftest.

Why?

Improve performance by distribute request on qps.

How?

Select channel by thread id modulo numbers of channel.

Summary by CodeRabbit

New Features
- Multi-channel device endpoint support for CUDA performance tests, allowing workloads to be distributed across configurable device channels for more accurate measurements.
Chores
- New channel count parameter added with a default of 1 and initialization added so tests behave consistently when channels are unspecified.

coderabbitai · 2025-11-06T03:39:35Z

Walkthrough

Added a device endpoint channel count parameter and propagated it through perf test initialization into CUDA context and kernel code. CUDA runtime now computes a per-thread channel_id (threadIdx.x % num_channels) and uses it in device_put_* calls; device-side arrays are copied via a new device_vector helper.

Changes

Cohort / File(s)	Summary
Public config `src/tools/perf/api/libperf.h`	Added `unsigned device_ep_channel_count` to `ucx_perf_params_t`.
Test initialization `src/tools/perf/perftest.c`, `src/tools/perf/perftest_params.c`	Initialize `device_ep_channel_count = 1` in perftest params and set it in adjust_test_params when device send is used.
CUDA context `src/tools/perf/cuda/cuda_kernel.cuh`	Added `unsigned num_channels` to `ucx_perf_cuda_context` and populate it from `perf.params.device_ep_channel_count` in the constructor.
CUDA kernel & runtime `src/tools/perf/cuda/ucp_cuda_kernel.cu`	Added `num_channels` to `ucp_perf_cuda_params`; initialize from perf params; compute `channel_id = threadIdx.x % num_channels` and pass it to `device_put_single`, `device_put_multi`, and `device_put_multi_partial`; added `device_vector<T>` helper and use it to allocate/copy indices, offsets, lengths to device.

Sequence Diagram

sequenceDiagram
    participant Config as Config (libperf)
    participant Test as Test Init
    participant Host as Host runtime
    participant GPU as CUDA Kernel
    participant NIC as Device ops

    Config->>Test: device_ep_channel_count
    Test->>Host: populate params (num_channels)
    Host->>Host: device_vector(copy indices/offsets/lengths)
    Host->>GPU: launch kernel (params, device arrays)
    GPU->>GPU: channel_id = threadIdx.x % num_channels
    GPU->>NIC: device_put_* (with channel_id)
    NIC-->>GPU: completion

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review areas needing attention:
- Correctness of channel_id computation and its usage across all device_put_* call sites.
- Memory lifetime and error handling in device_vector<T> (allocation/copy/free).
- Consistency of num_channels propagation between host structs and device-visible params.

Suggested reviewers

brminich
iyastreb
ofirfarjun7

Poem

🐰 New channels hum where threads once played,
I hop through loops where indices are laid.
From config seed to kernel's bright art,
Each thread finds a lane — a channeled heart. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'TOOLS/DEVICE: support channel id in perftest' directly and accurately summarizes the main change: adding channel ID support to the perftest tool.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84da1e1 and d4803a0.

📒 Files selected for processing (1)

src/tools/perf/cuda/ucp_cuda_kernel.cu (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-06T09:04:19.215Z

Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.215Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.

Applied to files:

src/tools/perf/cuda/ucp_cuda_kernel.cu

🧬 Code graph analysis (1)

src/tools/perf/cuda/ucp_cuda_kernel.cu (1)

src/ucp/api/device/ucp_device_impl.h (2)

ucp_device_put_single (143-166)

ucp_device_put_multi (264-290)

🔇 Additional comments (3)

src/tools/perf/cuda/ucp_cuda_kernel.cu (3)

113-113: LGTM!

The addition of the num_channels field to the parameter structure is appropriate for supporting channel-based routing.

214-217: LGTM!

The device_vector helper correctly allocates device memory and copies host data, with proper cleanup in the destructor. The usage for transferring indices and offsets to the device is appropriate.

Also applies to: 229-237

126-126: No issues found—device_ep_channel_count is always initialized to 1.

Verification confirms all code paths set device_ep_channel_count = 1 before use (perftest.c:224 and perftest_params.c:760). The modulo operation at line 249 cannot cause division by zero with current code. Defensive validation is unnecessary unless the TODO in perftest_params.c (reading channel count from ucp config) is implemented in the future.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e88f7 and 7355ab4.

📒 Files selected for processing (9)

contrib/test_jenkins.sh (1 hunks)
contrib/ucx_perftest_config/test_types_ucp_device_cuda (2 hunks)
src/tools/perf/api/libperf.h (3 hunks)
src/tools/perf/cuda/cuda_kernel.cuh (4 hunks)
src/tools/perf/cuda/ucp_cuda_kernel.cu (7 hunks)
src/tools/perf/perftest.c (2 hunks)
src/tools/perf/perftest.h (1 hunks)
src/tools/perf/perftest_params.c (3 hunks)
test/gtest/common/test_perf.cc (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

src/tools/perf/perftest.c (2)

src/tools/perf/lib/ucp_tests.cc (4)

params (197-219)

params (197-199)

params (221-231)

params (221-221)

src/ucs/sys/sys.c (1)

ucs_get_page_size (530-543)

src/tools/perf/perftest_params.c (1)

src/tools/perf/lib/uct_tests.cc (2)

params (118-130)

params (118-118)

src/tools/perf/cuda/ucp_cuda_kernel.cu (1)

src/ucp/api/device/ucp_device_impl.h (5)

ucs_status_t (83-104)

ucp_device_progress_req (441-451)

ucp_device_put_single (143-166)

ucp_device_put_multi (264-290)

ucp_device_put_multi_partial (347-377)

coderabbitai · 2025-11-06T03:45:24Z

src/tools/perf/cuda/cuda_kernel.cuh

+        if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
+            ucx_perf_cuda_time_t cur_time  = ucx_perf_cuda_get_time_ns();
+            ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /
+                                             (completed - m_ctx.completed_iters);
+            m_last_report_time             = cur_time;
+            m_ctx.completed_iters          = completed;
+            __threadfence_system();
+
+            m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),
+                                         m_max_iters);


⚠️ Potential issue | 🔴 Critical

Guard against zero deltas before dividing.

Here we divide by both (completed - m_ctx.completed_iters) and iter_time. If the progress callback fires twice with the same completed (e.g. because m_report_interval_ns / iter_time rounded to zero) or if the GPU clock hasn’t advanced yet, both denominators become zero and the SM traps with a divide-by-zero. Please bail out when delta == 0 and clamp iter_time to at least 1 before the final division.

@@ - if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { - ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); - ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) / - (completed - m_ctx.completed_iters); + if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { + ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters; + if (delta_iters == 0) { + return; + } + + ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); + ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time; + if (delta_time == 0) { + delta_time = 1; + } + + ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1); m_last_report_time = cur_time; m_ctx.completed_iters = completed; __threadfence_system(); - m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time), - m_max_iters); + ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time; + if (next_interval == 0) { + next_interval = 1; + } + m_next_report_iter = ucs_min(completed + next_interval, m_max_iters); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {

ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();

ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /

(completed - m_ctx.completed_iters);

m_last_report_time = cur_time;

m_ctx.completed_iters = completed;

__threadfence_system();

m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),

m_max_iters);

if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {

ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters;

if (delta_iters == 0) {

return;

}

ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();

ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time;

if (delta_time == 0) {

delta_time = 1;

}

ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1);

m_last_report_time = cur_time;

m_ctx.completed_iters = completed;

__threadfence_system();

ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time;

if (next_interval == 0) {

next_interval = 1;

}

m_next_report_iter = ucs_min(completed + next_interval, m_max_iters);

}

🤖 Prompt for AI Agents

In src/tools/perf/cuda/cuda_kernel.cuh around lines 52 to 61, the code can divide by zero when computing per-iteration times because both delta = (completed - m_ctx.completed_iters) and iter_time can be zero; compute and store delta first, if delta == 0 bail out/skip the reporting block to avoid dividing by zero, then compute iter_time and clamp it to at least 1 (e.g., use a max with 1) before using it in any division or in calculating m_next_report_iter so no division by zero occurs.

coderabbitai · 2025-11-06T03:45:24Z

src/tools/perf/cuda/ucp_cuda_kernel.cu

+        if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+            req   = nullptr;
+            flags = static_cast<ucp_device_flags_t>(0);
+        } else {
+            req            = &m_requests[index];
+            m_pending_map |= UCS_BIT(index);
+        }


⚠️ Potential issue | 🔴 Critical

Do not hand out a null request under flow control

When get_request() returns req == nullptr, the very next call to ucp_perf_cuda_send_async() passes that pointer straight into ucp_device_put_single/multi/multi_partial, all of which dereference req unconditionally. With device_fc_window > 1, we therefore crash every time this branch fires. Please always return a valid ucp_device_request_t* here and keep the existing bitmap semantics for the FC “prefetch” case.

- if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { - req = nullptr; - flags = static_cast<ucp_device_flags_t>(0); - } else { - req = &m_requests[index]; - m_pending_map |= UCS_BIT(index); - } + req = &m_requests[index]; + if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { + flags = static_cast<ucp_device_flags_t>(0); + } else { + m_pending_map |= UCS_BIT(index); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {

req = nullptr;

flags = static_cast<ucp_device_flags_t>(0);

} else {

req = &m_requests[index];

m_pending_map |= UCS_BIT(index);

}

req = &m_requests[index];

if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {

flags = static_cast<ucp_device_flags_t>(0);

} else {

m_pending_map |= UCS_BIT(index);

}

coderabbitai · 2025-11-06T03:45:25Z

src/tools/perf/perftest_params.c

+    case 'F':
+        return parse_int(opt_arg, &params->super.device_fc_window,
+                         "device flow control window size", 1, INT_MAX);
    case 'y':


⚠️ Potential issue | 🔴 Critical

Fix the -F parser’s target type.

parse_int() expects an int*, but we are handing it an unsigned*. With -Werror (the default in CI) this is an incompatible pointer type error, so the whole build breaks. Please parse into a local int and assign it back once validation succeeds.

@@ - case 'F': - return parse_int(opt_arg, &params->super.device_fc_window, - "device flow control window size", 1, INT_MAX); + case 'F': { + int fc_window; + + status = parse_int(opt_arg, &fc_window, + "device flow control window size", 1, INT_MAX); + if (status == UCS_OK) { + params->super.device_fc_window = fc_window; + } + return status; + }

🤖 Prompt for AI Agents

In src/tools/perf/perftest_params.c around lines 728 to 731, the -F option passes an unsigned int pointer to parse_int which expects an int*, causing an incompatible pointer type error under -Werror; fix it by parsing into a local int variable (call parse_int with &local_int and the same bounds), check parse_int succeeded, then assign local_int to params->super.device_fc_window (cast/convert to the unsigned field) after validation; ensure you preserve the same bounds and error handling behavior.

ofirfarjun7 · 2025-11-13T07:52:01Z

src/tools/perf/cuda/ucp_cuda_kernel.cu


+    void init_params(const ucx_perf_context_t &perf)
+    {
+        m_params.num_channels = perf.params.device_ep_channel_count;


Maybe just set it in ucp_perf_cuda_params_handler and remove this func.

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

TOOLS/DEVICE: support channel id in perftest

84da1e1

jeynmann force-pushed the perf_chan_id branch from 7355ab4 to 84da1e1 Compare November 6, 2025 17:33

ofirfarjun7 reviewed Nov 13, 2025

View reviewed changes

TOOLS/DEVICE: fix according to comments

d4803a0

jeynmann changed the title ~~[DNM][WIP] TOOLS/DEVICE: support channel id in perftest~~ TOOLS/DEVICE: support channel id in perftest Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TOOLS/DEVICE: support channel id in perftest #10993

TOOLS/DEVICE: support channel id in perftest #10993

Uh oh!

jeynmann commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 6, 2025

Uh oh!

coderabbitai bot Nov 6, 2025

Uh oh!

coderabbitai bot Nov 6, 2025

Uh oh!

ofirfarjun7 Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TOOLS/DEVICE: support channel id in perftest #10993

Are you sure you want to change the base?

TOOLS/DEVICE: support channel id in perftest #10993

Uh oh!

Conversation

jeynmann commented Nov 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ofirfarjun7 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeynmann commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading